{r global_options, include=FALSE}
knitr::opts_chunk$set(fig.width=12, fig.height=8, fig.path='Figs/',
echo=FALSE,warning=FALSE, message=FALSE)
This report explores a dataset containing over 100,000 loans from the lending company Prosper.
## [1] 113937 81
Although there are 81 fields, I want to concern myself with only a limited number of these for my analysis. The fields I chose to focus on were:
keep = c('CreditGrade',
'ListingCreationDate',
'Term',
'BorrowerAPR',
'ProsperRating..numeric.',
'ProsperScore',
'ListingCategory..numeric.',
'CreditScoreRangeLower',
'CreditScoreRangeUpper',
'StatedMonthlyIncome',
'LoanOriginalAmount',
'Investors')
data = data[keep]
str(data)
## 'data.frame': 113937 obs. of 12 variables:
## $ CreditGrade : Factor w/ 8 levels "A","AA","B","C",..: 4 NA 7 NA NA NA NA NA NA NA ...
## $ ListingCreationDate : Factor w/ 113064 levels "2005-11-09 20:44:28.847000000",..: 14184 111894 6429 64760 85967 100310 72556 74019 97834 97834 ...
## $ Term : int 36 36 36 36 36 60 36 36 36 36 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ ProsperRating..numeric. : int NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperScore : num NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory..numeric.: int 0 2 0 16 2 1 1 2 7 7 ...
## $ CreditScoreRangeLower : int 640 680 480 800 680 740 680 700 820 820 ...
## $ CreditScoreRangeUpper : int 659 699 499 819 699 759 699 719 839 839 ...
## $ StatedMonthlyIncome : num 3083 6125 2083 2875 9583 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
Next I want to do some basic data wrangling to organize my data.
# Want Credit Grade to be an ordered Factor
data$CreditGrade = factor(data$CreditGrade,
ordered=T,
levels=c("AA", "A", "B", "C",
"D", "E", "HR", "NC"))
# Creation date
data$ListingCreationDate = as.Date(data$ListingCreationDate)
data$CreationMonth = as.Date(cut(data$ListingCreationDate, breaks='month'))
p_levels = unique(sort(format(as.Date(data$ListingCreationDate), '%m')))
data$NumericMonth = factor(format(as.Date(data$ListingCreationDate),
'%m'),
ordered=TRUE,
levels=p_levels)
# Term is probably more accurately reflected as an ordered factor as well
data$Term = factor(data$Term, ordered=TRUE, levels=c(12, 36, 60))
#Renaming ProsperRating and setting as a factor
names(data)[names(data) == 'ProsperRating..numeric.'] = 'ProsperRating'
data$ProsperRating = factor(data$ProsperRating, ordered=TRUE, levels=seq(1, 7, 1))
# Ordered factor for Prosper Score
data$ProsperScore = factor(data$ProsperScore, ordered=TRUE, levels=seq(1, 11, 1))
# Rename listing category and map to actual concepts
names(data)[names(data) == 'ListingCategory..numeric.'] = 'ListingCategory'
p_from = seq(0, 20, 1)
# this is frustrating, but I'm not sure there's a clearer way
p_to = c('Not Available',
'Debt Consolidation',
'Home Improvement',
'Business',
'Personal Loan',
'Student Use',
'Auto',
'Other',
'Baby&Adoption',
'Boat',
'Cosmetic Procedure',
'Engagement Ring',
'Green Loans',
'Household Expenses',
'Large Purchases',
'Medical/Dental',
'Motorcycle',
'RV',
'Taxes',
'Vacation',
'Wedding Loans')
data$ListingCategory = factor(mapvalues(data$ListingCategory,
from = p_from,
to = p_to),
levels=p_to)
data$CreditScore = (data$CreditScoreRangeUpper + data$CreditScoreRangeLower) / 2
data$CreditScoreRangeUpper = NULL
data$CreditScoreRangeLower = NULL
Great, so what does our data look like now?
str(data)
## 'data.frame': 113937 obs. of 13 variables:
## $ CreditGrade : Ord.factor w/ 8 levels "AA"<"A"<"B"<"C"<..: 4 NA 7 NA NA NA NA NA NA NA ...
## $ ListingCreationDate: Date, format: "2007-08-26" "2014-02-27" ...
## $ Term : Ord.factor w/ 3 levels "12"<"36"<"60": 2 2 2 2 2 3 2 2 2 2 ...
## $ BorrowerAPR : num 0.165 0.12 0.283 0.125 0.246 ...
## $ ProsperRating : Ord.factor w/ 7 levels "1"<"2"<"3"<"4"<..: NA 6 NA 6 3 5 2 4 7 7 ...
## $ ProsperScore : Ord.factor w/ 11 levels "1"<"2"<"3"<"4"<..: NA 7 NA 9 4 10 2 4 9 11 ...
## $ ListingCategory : Factor w/ 21 levels "Not Available",..: 1 3 1 17 3 2 2 3 8 8 ...
## $ StatedMonthlyIncome: num 3083 6125 2083 2875 9583 ...
## $ LoanOriginalAmount : int 9425 10000 3001 10000 15000 15000 3000 10000 10000 10000 ...
## $ Investors : int 258 1 41 158 20 1 1 1 1 1 ...
## $ CreationMonth : Date, format: "2007-08-01" "2014-02-01" ...
## $ NumericMonth : Ord.factor w/ 12 levels "01"<"02"<"03"<..: 8 2 1 10 9 12 4 5 12 12 ...
## $ CreditScore : num 650 690 490 810 690 ...
summary(data)
## CreditGrade ListingCreationDate Term BorrowerAPR
## C : 5649 Min. :2005-11-09 12: 1614 Min. :0.00653
## D : 5153 1st Qu.:2008-09-19 36:87778 1st Qu.:0.15629
## B : 4389 Median :2012-06-16 60:24545 Median :0.20976
## AA : 3509 Mean :2011-07-08 Mean :0.21883
## HR : 3508 3rd Qu.:2013-09-09 3rd Qu.:0.28381
## (Other): 6745 Max. :2014-03-10 Max. :0.51229
## NA's :84984 NA's :25
## ProsperRating ProsperScore ListingCategory
## 4 :18345 4 :12595 Debt Consolidation:58308
## 5 :15581 6 :12278 Not Available :16965
## 6 :14551 8 :12053 Other :10494
## 3 :14274 7 :10597 Home Improvement : 7433
## 2 : 9795 5 : 9813 Business : 7189
## (Other):12307 (Other):27517 Auto : 2572
## NA's :29084 NA's :29084 (Other) :10976
## StatedMonthlyIncome LoanOriginalAmount Investors
## Min. : 0 Min. : 1000 Min. : 1.00
## 1st Qu.: 3200 1st Qu.: 4000 1st Qu.: 2.00
## Median : 4667 Median : 6500 Median : 44.00
## Mean : 5608 Mean : 8337 Mean : 80.48
## 3rd Qu.: 6825 3rd Qu.:12000 3rd Qu.: 115.00
## Max. :1750003 Max. :35000 Max. :1189.00
##
## CreationMonth NumericMonth CreditScore
## Min. :2005-11-01 01 :11214 Min. : 9.5
## 1st Qu.:2008-09-01 10 :10539 1st Qu.:669.5
## Median :2012-06-01 12 :10320 Median :689.5
## Mean :2011-06-24 02 :10124 Mean :695.1
## 3rd Qu.:2013-09-01 09 :10074 3rd Qu.:729.5
## Max. :2014-03-01 11 : 9952 Max. :889.5
## (Other):51714 NA's :591
ggplot(aes(x=Investors), data=data) +
geom_histogram(boundary=0)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Intuitively, having a number of investors already paired with a loan might encourage further investment. This indicates that the number of investors might follow a multiplicative process. I’ll take a log transformation to better grasp the distribution.
ggplot(aes(x=Investors), data=data) +
geom_histogram(boundary=0) +
scale_x_log10()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Interesting, so if you throw out the common case where there are zero investors, the distribution appears roughly normal or skewed normal. Definitely want to look at that peak on the far left though
sort(table(data$Investors), decreasing=TRUE)[1:5]
##
## 1 2 3 4 5
## 27814 1386 991 827 753
The peak at the beginning is one of the issues with taking logs of counts. When the # of investors is 1, the log of that value is 0 which screws with my distribution quite a bit.
Next I want to look into what credit grade is assigned to most of the listings.
ggplot(aes(x=CreditGrade), data=subset(data, !is.na(CreditGrade))) +
geom_bar()
It would appear that most listings fall int he middle with a ‘C’ grade, although there are a significant number in all grades.
When was Prosper’s platform most popular? Is there a time when most of the listings were created?
ggplot(aes(x=CreationMonth, group=1), data=data) +
geom_point(stat='count') +
geom_line(stat='count') +
scale_x_date(labels = date_format('%Y-%m'),
date_breaks = '6 month',
limits = c(as.Date('2005-6-1'), as.Date('2014-12-1'))) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
There appears a sharp drop off in early to mid 2008 indicating an impact from the recession. There is also a sharp drop off in early 2014, and it might be useful to read up on the company and examine the possible cause. I don’t see a clear monthly pattern, but I’ll check the distribution just in case.
ggplot(aes(x=NumericMonth), data=data) +
geom_bar()
Next we’ll check the frequency of different term lengths for the loans.
ggplot(aes(x=Term), data=data) + geom_bar()
It looks like the vast majority of loans are 36 month (3 year) loans, with very few being a 12 month loan. It might be interesting later to look at the relationship between loan term and the amount.
ggplot(aes(x=BorrowerAPR), data=data) + geom_histogram(boundary=0)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 25 rows containing non-finite values (stat_bin).
summary(data$BorrowerAPR)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.00653 0.15630 0.20980 0.21880 0.28380 0.51230 25
The APR distribution appears to have a couple of distinct peaks with a few clear outliers at the high and low end.
ggplot(aes(x=ProsperRating), data=subset(data, !is.na(ProsperRating))) +
geom_bar()
ggplot(aes(x=ProsperScore), data=subset(data, !is.na(ProsperScore))) +
geom_bar()
summary(data$ListingCategory)
## Not Available Debt Consolidation Home Improvement
## 16965 58308 7433
## Business Personal Loan Student Use
## 7189 2395 756
## Auto Other Baby&Adoption
## 2572 10494 199
## Boat Cosmetic Procedure Engagement Ring
## 85 91 217
## Green Loans Household Expenses Large Purchases
## 59 1996 876
## Medical/Dental Motorcycle RV
## 1522 304 52
## Taxes Vacation Wedding Loans
## 885 768 771
I want to look only at the most popular listing categories
sub_data = subset(data, !is.na(ListingCategory) &
ListingCategory != 'Other' &
ListingCategory != 'Not Available')
categories = names(sort(table(sub_data$ListingCategory), decreasing = T)[1:5])
ggplot(aes(x=ListingCategory),
data=subset(data, ListingCategory %in% categories)) +
geom_bar()
Clearly Debt Consolidation is by far the most popular use for Prosper.
summary(data$StatedMonthlyIncome)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 3200 4667 5608 6825 1750000
I would like to be the guy who makes 175K a month
ggplot(aes(x=StatedMonthlyIncome), data=data) +
geom_histogram(boundary=0)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Hmm, I going to go ahead an ignore the top 1% of values because that is cleary skewing the distribution.
ggplot(aes(x=StatedMonthlyIncome), data=data) +
geom_histogram(boundary=0) +
xlim(0, quantile(data$StatedMonthlyIncome, .99))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 1140 rows containing non-finite values (stat_bin).
It looks like either a log normal or exponential distribution. I’ll take a log transformation to try and get a better idea of the data
ggplot(aes(x=StatedMonthlyIncome), data=data) +
geom_histogram(binwidth = .1) +
xlim(0, quantile(data$StatedMonthlyIncome, .99)) +
scale_x_log10()
## Scale for 'x' is already present. Adding another scale for 'x', which
## will replace the existing scale.
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Removed 1394 rows containing non-finite values (stat_bin).
Now we’re looking roughly normal with a few clear outliers and a low standard deviation
I want to look into the high and low outliers that to see if there is anything weird going on.
head(data[order(data$StatedMonthlyIncome),])
## CreditGrade ListingCreationDate Term BorrowerAPR ProsperRating
## 79 A 2008-06-29 36 0.18454 <NA>
## 101 <NA> 2013-01-25 36 0.35356 1
## 109 C 2008-04-29 36 0.28320 <NA>
## 126 <NA> 2011-05-10 36 0.35643 2
## 171 <NA> 2012-04-14 36 0.28851 3
## 187 <NA> 2012-02-03 36 0.35797 1
## ProsperScore ListingCategory StatedMonthlyIncome LoanOriginalAmount
## 79 <NA> Business 0 4800
## 101 3 Household Expenses 0 4000
## 109 <NA> Debt Consolidation 0 14000
## 126 5 Other 0 5000
## 171 5 Other 0 10000
## 187 5 Home Improvement 0 4000
## Investors CreationMonth NumericMonth CreditScore
## 79 115 2008-06-01 06 729.5
## 101 73 2013-01-01 01 729.5
## 109 53 2008-04-01 04 649.5
## 126 92 2011-05-01 05 649.5
## 171 161 2012-04-01 04 769.5
## 187 45 2012-02-01 02 669.5
A large number of people list their monthly income as 0, which screws with the log distribution. After examining the orignal dataframe, it’s difficult to tell whether this data is a mistake or not, so we’ll just take it at face value.
tail(data[order(data$StatedMonthlyIncome),])
## CreditGrade ListingCreationDate Term BorrowerAPR ProsperRating
## 113271 <NA> 2013-02-20 36 0.22712 4
## 8067 <NA> 2011-04-02 36 0.35643 1
## 57134 <NA> 2011-03-24 36 0.35643 2
## 17412 <NA> 2011-04-08 36 0.35643 2
## 53015 <NA> 2013-01-04 12 0.25785 3
## 53168 <NA> 2012-02-20 36 0.35797 1
## ProsperScore ListingCategory StatedMonthlyIncome
## 113271 6 Other 394400.0
## 8067 4 Debt Consolidation 416666.7
## 57134 5 Home Improvement 466666.7
## 17412 5 Debt Consolidation 483333.3
## 53015 3 Business 618547.8
## 53168 4 Business 1750002.9
## LoanOriginalAmount Investors CreationMonth NumericMonth CreditScore
## 113271 2000 26 2013-02-01 02 689.5
## 8067 2000 10 2011-04-01 04 669.5
## 57134 4000 72 2011-03-01 03 689.5
## 17412 7500 128 2011-04-01 04 669.5
## 53015 4000 93 2013-01-01 01 729.5
## 53168 4000 2 2012-02-01 02 749.5
ggplot(aes(x=LoanOriginalAmount), data=data) +
geom_histogram(binwidth = 3000, boundary=0)
Not a clear distribution for loan amount, but it appears as though it could be normal. I’ll look at the distribution of the log value just to see how it looks.
ggplot(aes(x=LoanOriginalAmount), data=data) +
geom_histogram(binwidth = .1) +
scale_x_log10()
Original loan amount doesn’t really fit very neatly into a distribution.
CreditScore:
ggplot(aes(x=CreditScore), data=data) + geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 591 rows containing non-finite values (stat_bin).
Distribution appears roughly normal. It seems safe to assume that when the credit score is 0, that’s a case where the data is simply incomplete.
There are 113,937 observations with 81 different features, however I chose to for this analysis I made some basic transformations and chose to focus on 12 of them.
I chose to treat the variables Term, ProsperRating, and ProsperScore as ordered factors. The ordering is obvious as they are all whole numbers from 1 to the max level for the factor. ListingCategory is an unordered factor with 21 levels.
Other observations:
The main features of interest are BorrowerAPR and factors that intuitively correlate with risk, with the core one being CreditScore.
I think that ProsperRating, ProsperScore, and CreditGrade are all important factors in determining BorrowerAPR. I cannot say with certainty which will be most influential.
I averaged the high credit score and the low credit score in order to create a single numeric credit score. I also found the month for each listing creation date to look for any patterns.
A number of the features were clearly exponential distributions. When analyzing relationships, it is generally more informative when a variable is normally distributed. For these variables I looked at the log of the distribution. Eg:
This is pretty noisy so I’ll look into relationships with individual plots.
First I want to look more closely at the impact of CreditGrade. For these plots I’ll exclude ‘NA’ and ‘NC’ grades, considering them incomplete data.
credit_grade_idx = (data$CreditGrade != 'NC') & (!is.na(data$CreditGrade))
ggplot(aes(x=CreditGrade, y=BorrowerAPR), data=data[credit_grade_idx, ]) +
geom_boxplot()
## Warning: Removed 24 rows containing non-finite values (stat_boxplot).
As expected, loans with better credit grades resulted in a lower rate for the borrower. However, for all cases there are borrowers that represent significant outliers. I will look into these outlying observations later in the analysis.
I also want to see if there is a relationship between the Term of the loan and the rate borrowers pay. Intuitively you might expect borrowers with longer loan periods to pay a lower rate.
ggplot(aes(x=Term, y=BorrowerAPR), data=data) +
geom_boxplot()
## Warning: Removed 25 rows containing non-finite values (stat_boxplot).
It’s interesting we see almost no difference in APR between different term periods. However, it is worth noting that the 60 month loans have a much lower spread than the shorter loans. For concrete numbers here are the standard deviations:
apr_idx = !is.na(data$BorrowerAPR)
by(data[apr_idx,]$BorrowerAPR, data[apr_idx,]$Term, sd)
## data[apr_idx, ]$Term: 12
## [1] 0.09023572
## --------------------------------------------------------
## data[apr_idx, ]$Term: 36
## [1] 0.08548865
## --------------------------------------------------------
## data[apr_idx, ]$Term: 60
## [1] 0.05747886
And interquartile ranges:
by(data[apr_idx,]$BorrowerAPR, data[apr_idx,]$Term, IQR)
## data[apr_idx, ]$Term: 12
## [1] 0.144505
## --------------------------------------------------------
## data[apr_idx, ]$Term: 36
## [1] 0.14408
## --------------------------------------------------------
## data[apr_idx, ]$Term: 60
## [1] 0.08534
Prosper Rating:
rating_idx = !is.na(data$ProsperRating)
ggplot(aes(x=ProsperRating, y=BorrowerAPR), data=data[rating_idx,]) +
geom_boxplot()
As you might expect, higher ratings pay a lower APR and the differences appear pretty significant. What is surprising is the huge number of outliers, especially for the lowest rating.
ProsperScore:
score_idx = !is.na(data$ProsperScore)
ggplot(aes(x=ProsperScore, y=BorrowerAPR), data=data[score_idx, ]) +
geom_boxplot()
The general pattern is consistent with what you might expect, but there is quite a bit more variability than with ProsperRating. Clearly many investors have different feelings about the level of risk assigned to loans than Prosper.
ListingCategory:
listing_idx = (!is.na(data$ListingCategory) &
(data$ListingCategory != 'Not Available'))
ggplot(aes(x=ListingCategory, y=BorrowerAPR), data=data[listing_idx, ]) +
geom_boxplot() +
theme(axis.text.x = element_text(angle = -90, hjust = 0, vjust = 0))
In general, the listing category doesn’t seem to have a big impact on the apr the borrower pays. What interesting however are the two apparrent exceptions with personal loans paying a lower apr and cosmetic procedures paying a higher apr. Perhaps this says something about the biases of the lenders.
StatedMonthlyIncome:
Next we’ll look at a continuous variable.
income_idx = !is.na(data$StatedMonthlyIncome)
ggplot(aes(x=StatedMonthlyIncome, y=BorrowerAPR),
data=data[income_idx, ]) +
geom_point()
## Warning: Removed 25 rows containing missing values (geom_point).
Recall from the histogram of income that a few very high values skew the distribution. I’ll remove those to get a better idea of the relationship.
ggplot(aes(x=StatedMonthlyIncome, y=BorrowerAPR),
data=data[income_idx, ]) +
geom_point() +
xlim(0, quantile(data$StatedMonthlyIncome, .99))
## Warning: Removed 1163 rows containing missing values (geom_point).
A lot of incomes and APRs clearly are set to neat values. I’ll jitter the points to help get a clearer picture as well as increase the transparency.
income_log_idx = data$StatedMonthlyIncome > 1
ggplot(aes(x=StatedMonthlyIncome, y=BorrowerAPR),
data=data[income_idx & income_log_idx, ]) +
geom_point(alpha=.01, position = 'jitter') +
xlim(0, quantile(data$StatedMonthlyIncome, .99)) +
geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 1163 rows containing non-finite values (stat_smooth).
## Warning: Removed 1163 rows containing missing values (geom_point).
There doesn’t appear to be any relationship at all between income and APR. If anything there is a slight negative relationship. This is surprising to me, but I suspect those with larger incomes also ask for larger loan amounts. I would also like to try a simple log transformation, as earlier that helped the income distribution appear more normal.
ggplot(aes(x=StatedMonthlyIncome, y=BorrowerAPR),
data=data[income_idx, ]) +
geom_point(alpha=.01, position = 'jitter') +
#xlim(1, quantile(data$StatedMonthlyIncome, .99)) +
scale_x_log10(limits=c(quantile(data$StatedMonthlyIncome, .05),
quantile(data$StatedMonthlyIncome, .99))) +
geom_smooth()
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous x-axis
## `geom_smooth()` using method = 'gam'
## Warning: Removed 6844 rows containing non-finite values (stat_smooth).
## Warning: Removed 5468 rows containing missing values (geom_point).
LoanOriginalAmount:
amount_idx = !is.na(data$LoanOriginalAmount)
ggplot(aes(x=LoanOriginalAmount, y=BorrowerAPR),
data=data[amount_idx, ]) +
geom_point()
## Warning: Removed 25 rows containing missing values (geom_point).
Again high density clusters of points and distinct verticals that are common. I’ll fade and jitter the points.
ggplot(aes(x=LoanOriginalAmount, y=BorrowerAPR),
data=data[amount_idx, ]) +
#geom_point(alpha=.01, position = 'jitter_x')
geom_jitter(alpha=.01, width=100) +
geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 25 rows containing non-finite values (stat_smooth).
## Warning: Removed 25 rows containing missing values (geom_point).
There is a large amount of variability, but there does appear to be a slight negative relationship between loan amount and borrower APR. I suspect this is a result of wealthier borrowers taking out larger loans. We’ll check that assumption quickly.
ggplot(aes(x=LoanOriginalAmount, y=StatedMonthlyIncome),
data=data) +
geom_point(position='jitter', alpha=.01) +
ylim(0, quantile(data$StatedMonthlyIncome, .99)) +
geom_smooth()
## `geom_smooth()` using method = 'gam'
## Warning: Removed 1140 rows containing non-finite values (stat_smooth).
## Warning: Removed 1851 rows containing missing values (geom_point).
It’s not perfect, but there’s clearly a positionve linear relationship between income and loan amount which supports my theory.
Investors:
investor_idx = !is.na(data$Investors)
ggplot(aes(x=Investors, y=BorrowerAPR), data=data[investor_idx, ]) +
geom_point()
## Warning: Removed 25 rows containing missing values (geom_point).
In the univariate analysis it was clear that Investors follows an exponential relationship so I’ll take a log transformation. I also want to exclude cases with 0 investors.
ggplot(aes(x=Investors, y=BorrowerAPR), data=data[investor_idx, ]) +
geom_point(alpha=.05, position='jitter') +
scale_x_log10(limits=c(1, max(data$Investors)))
## Warning: Removed 13865 rows containing missing values (geom_point).
There isn’t really a clear picture about the relationship between the number of investors and the APR.
CreditScore: (Assuming anything under 300 is a mistake)
credit_score_idx = !is.na(data$CreditScore) & (data$CreditScore >= 300)
ggplot(aes(x=CreditScore, y=BorrowerAPR), data=data[credit_score_idx, ]) +
geom_point(position = 'jitter', alpha=.01) +
geom_smooth(method='lm')
As expected, there appears to be a negative linear relationship between credit score and APR.
Unsurprisingly, the different grading systems (CreditGrade, ProsperRating, ProsperScore) were all strong predictors of BorrowerAPR, with ProsperScore being the least reliable.
The length of the loan (Term) does not appear to impact the BorrowerAPR.
ListingCategory seems not to matter much with the notable exceptions of ‘Cosmetic Procedure’ and ‘Personal Loan’.
After removing outliers, there appears to be a weak negative linear relationship between StatedMonthlyIncome and BorrowerAPR.
There is also a weak negative linear relationship between LoanOriginalAmount and BorrowerAPR.
Also there is a clear negative lineral relationship between CreditScore and BorrowerAPR.
There is a strong positive relationship between StatedMonthlyIncome and LoanOriginalAmount so I suspect having both in the model would be a solid predictor of BorrowerAPR.
The relationship between CreditScore and BorrowerAPR. Intuitively, this makes a lot of sense.
I want to explore the obvious outliers in my BorrowerAPR vs CreditGrade boxplots. Why are some borrowers with good credit grades receiving high APRs and vice versa.
Including CreditScore doesn’t help us explain the CreditGrade APR outliers. It is clear from the clean color bands in the plot that CreditScore is almost entirely determinite of CreditGrade with the bands becoming a little fuzzier for the lower credit scores.
sub_data = data[income_idx & income_log_idx & credit_grade_idx,]
ggplot(aes(x=StatedMonthlyIncome, y=BorrowerAPR), data=sub_data) +
geom_point(position='jitter', alpha=1, aes(color=CreditGrade)) +
xlim(0, quantile(sub_data$StatedMonthlyIncome, .99)) +
scale_color_brewer(type='seq')
## Warning: Removed 303 rows containing missing values (geom_point).
There actually doesn’t appear to be a strong relationship between income and borrower APR at all. It might be more interesting to look at loan amount relative to income.
sub_data = data[income_idx & income_log_idx & credit_grade_idx,]
ggplot(aes(x=LoanOriginalAmount/StatedMonthlyIncome, y=BorrowerAPR),
data=sub_data) +
geom_point(position='jitter', alpha=1, aes(color=CreditGrade)) +
xlim(0,
quantile(sub_data$LoanOriginalAmount/sub_data$StatedMonthlyIncome, .99))+
scale_color_brewer(type='seq')
## Warning: Removed 303 rows containing missing values (geom_point).
wh, so this is a little more informative. There appears to be a positive linear relationship between this ratio and and APR. Let’s limit ourselves ot just high credit grades.
idx = data$CreditGrade %in% c('AA', 'A')
sub_data = data[income_idx & income_log_idx & credit_grade_idx & idx,]
ggplot(aes(x=LoanOriginalAmount/StatedMonthlyIncome, y=BorrowerAPR),
data=sub_data) +
geom_point(position='jitter', alpha=.3, aes(color=CreditGrade)) +
xlim(0,
quantile(sub_data$LoanOriginalAmount/sub_data$StatedMonthlyIncome, .99)) +
geom_smooth(method='lm')
## Warning: Removed 76 rows containing non-finite values (stat_smooth).
## Warning: Removed 80 rows containing missing values (geom_point).
It looks like the ratio of the loan to income helps to explain at least part of the reason borrowers with a high credit grade might be given a higher APR. This is especially true for high credit grade borrowers.
Let’s also see if time when the loan is given has an impact on APR.
ggplot(aes(x=CreationMonth, y=BorrowerAPR),
data=data[credit_grade_idx,]) +
geom_point(position='jitter', alpha=1, aes(color=CreditGrade)) +
scale_color_brewer(type='seq')
## Warning: Removed 24 rows containing missing values (geom_point).
There’s definitely a weird cap on higher APR rates for low credit grade borrowers during the midst of the recession.
It appears that a borrower’s credit grade is almost entirely a function of their averaged credit score. However, this doesn’t entirely explain some of the weird outliers. One possible explantion appeared to be the loan to income ratio.
It was surprising how perfectly credit grade maps to credit score. I would have thought Prosper used a more sophisticated mdoel.
I used a basic linear regression to model the relationship between APR and the loan amount to income ratio for high credit grade borrowers.
Although the listing category may not be the most informative feature for most loans, I though it was interesting because of the outliers. It’s clear that the lending market is not kind to those taking out loans for “cosmetic procedures”, but there is often times a discount for something as generic as “personal use”.
Unsurprisingly one of the best predictors of a borrower’s APR is their credit score. It just demonstrates that even on more modern lending platforms, it’s still important to carefully manage your credit.
## Warning: Removed 76 rows containing non-finite values (stat_smooth).
## Warning: Removed 82 rows containing missing values (geom_point).
This plot shows the impact of a borrower’s loan amount to income ratio on APR. It demonstrates that people who borrow a lot realtive to their monthly income can expect to pay a higher APR.
The Prosper dataset contains over 100,000 loans with 81 different features for each loan. In my project, I chose to focus on only a small subset of those features. First I plotted the distribution of each feature individually in order to get a feel for my data and look for potential outliers or oddities.
Next I looked into what factors determined the APR for an individual borrower. Unsurprisingly, I found the Prosper rating systems such as Credit Grade, Prosper Rating, and Prosper Score were all strong indicators of what APR a borrower might pay. Additionally, credit score was strongly correlated with APR. However, there were numerous outliers and I wanted to examine how these occurred. I found that borrowers with high credit grades who still borrowed a large amount relative to their monthly incomes were likely to pay a higher APR than you might otherwise expect.
There are obvious limitation with the conclusions I draw. First, I only looked at a very small subset of the total number of features. There could be much stronger correlations and patterns that I’m not seeing. Second, my final plots only looked at Credit Grade, which only exists up till 2009, so I’m looking at only a subset of my data. Given that there are so many variables that can affect APR, in the future I might look into outside economic indicators such as Fed interest rates to see if those cause some of the noise in my data.